Search CORE

1,074 research outputs found

Approximate F_2-Sketching of Valuation Functions

Author: Yaroslavtsev Grigory
Zhou Samson
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)
Publication date: 01/01/2019
Field of study

We study the problem of constructing a linear sketch of minimum dimension that allows approximation of a given real-valued function f : F_2^n - > R with small expected squared error. We develop a general theory of linear sketching for such functions through which we analyze their dimension for most commonly studied types of valuation functions: additive, budget-additive, coverage, alpha-Lipschitz submodular and matroid rank functions. This gives a characterization of how many bits of information have to be stored about the input x so that one can compute f under additive updates to its coordinates. Our results are tight in most cases and we also give extensions to the distributional version of the problem where the input x in F_2^n is generated uniformly at random. Using known connections with dynamic streaming algorithms, both upper and lower bounds on dimension obtained in our work extend to the space complexity of algorithms evaluating f(x) under long sequences of additive updates to the input x presented as a stream. Similar results hold for simultaneous communication in a distributed setting

Dagstuhl Research Online Publication Server

Approximating Cumulative Pebbling Cost Is Unique Games Hard

Author: Blocki Jeremiah
Lee Seunghoon
Zhou Samson
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 11th Innovations in Theoretical Computer Science Conference (ITCS 2020)
Publication date: 15/11/2019
Field of study

The cumulative pebbling complexity of a directed acyclic graph

G

is defined as

\mathsf{cc}(G) = \min_P \sum_i |P_i|

, where the minimum is taken over all legal (parallel) black pebblings of

G

and

|P_i|

denotes the number of pebbles on the graph during round

i

. Intuitively,

\mathsf{cc}(G)

captures the amortized Space-Time complexity of pebbling

m

copies of

G

in parallel. The cumulative pebbling complexity of a graph

G

is of particular interest in the field of cryptography as

\mathsf{cc}(G)

is tightly related to the amortized Area-Time complexity of the Data-Independent Memory-Hard Function (iMHF)

f_{G,H}

[AS15] defined using a constant indegree directed acyclic graph (DAG)

G

and a random oracle

H(\cdot)

. A secure iMHF should have amortized Space-Time complexity as high as possible, e.g., to deter brute-force password attacker who wants to find

x

such that

f_{G,H}(x) = h

. Thus, to analyze the (in)security of a candidate iMHF

f_{G,H}

, it is crucial to estimate the value

\mathsf{cc}(G)

but currently, upper and lower bounds for leading iMHF candidates differ by several orders of magnitude. Blocki and Zhou recently showed that it is

\mathsf{NP}

-Hard to compute

\mathsf{cc}(G)

, but their techniques do not even rule out an efficient

(1+\varepsilon)

-approximation algorithm for any constant

\varepsilon>0

. We show that for any constant

c > 0

, it is Unique Games hard to approximate

\mathsf{cc}(G)

to within a factor of

c

. (See the paper for the full abstract.)Comment: 28 pages, updated figures and corrected typo

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Computationally Data-Independent Memory Hard Functions

Author: Ameri Mohammad Hassan
Blocki Jeremiah
Zhou Samson
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 11th Innovations in Theoretical Computer Science Conference (ITCS 2020)
Publication date: 15/11/2019
Field of study

Memory hard functions (MHFs) are an important cryptographic primitive that are used to design egalitarian proofs of work and in the construction of moderately expensive key-derivation functions resistant to brute-force attacks. Broadly speaking, MHFs can be divided into two categories: data-dependent memory hard functions (dMHFs) and data-independent memory hard functions (iMHFs). iMHFs are resistant to certain side-channel attacks as the memory access pattern induced by the honest evaluation algorithm is independent of the potentially sensitive input e.g., password. While dMHFs are potentially vulnerable to side-channel attacks (the induced memory access pattern might leak useful information to a brute-force attacker), they can achieve higher cumulative memory complexity (CMC) in comparison than an iMHF. In particular, any iMHF that can be evaluated in N steps on a sequential machine has CMC at most ?((N^2 log log N)/log N). By contrast, the dMHF scrypt achieves maximal CMC ?(N^2) - though the CMC of scrypt would be reduced to just ?(N) after a side-channel attack. In this paper, we introduce the notion of computationally data-independent memory hard functions (ciMHFs). Intuitively, we require that memory access pattern induced by the (randomized) ciMHF evaluation algorithm appears to be independent from the standpoint of a computationally bounded eavesdropping attacker - even if the attacker selects the initial input. We then ask whether it is possible to circumvent known upper bound for iMHFs and build a ciMHF with CMC ?(N^2). Surprisingly, we answer the question in the affirmative when the ciMHF evaluation algorithm is executed on a two-tiered memory architecture (RAM/Cache). We introduce the notion of a k-restricted dynamic graph to quantify the continuum between unrestricted dMHFs (k=n) and iMHFs (k=1). For any ? > 0 we show how to construct a k-restricted dynamic graph with k=?(N^(1-?)) that provably achieves maximum cumulative pebbling cost ?(N^2). We can use k-restricted dynamic graphs to build a ciMHF provided that cache is large enough to hold k hash outputs and the dynamic graph satisfies a certain property that we call "amenable to shuffling". In particular, we prove that the induced memory access pattern is indistinguishable to a polynomial time attacker who can monitor the locations of read/write requests to RAM, but not cache. We also show that when k=o(N^(1/log log N))then any k-restricted graph with constant indegree has cumulative pebbling cost o(N^2). Our results almost completely characterize the spectrum of k-restricted dynamic graphs

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Improved Algorithms for Time Decay Streams

Author: Braverman Vladimir
Lang Harry
Ullah Enayat
Zhou Samson
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)
Publication date: 01/01/2019
Field of study

In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a coreset, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions. We also consider the exponential time decay model for k-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores O(k log(h Delta)+h) points where h is the half-life of the decay function and Delta is the aspect ratio of the dataset. Our techniques extend to k-means clustering and M-estimators as well

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Nearly Optimal Sparse Group Testing

Author: Gandikota Venkata
Grigorescu Elena
Jaggi Sidharth
Zhou Samson
Publication venue
Publication date: 19/09/2018
Field of study

Group testing is the process of pooling arbitrary subsets from a set of

n

items so as to identify, with a minimal number of tests, a "small" subset of

d

defective items. In "classical" non-adaptive group testing, it is known that when

d

is substantially smaller than

n

\Theta(d\log(n))

tests are both information-theoretically necessary and sufficient to guarantee recovery with high probability. Group testing schemes in the literature meeting this bound require most items to be tested

\Omega(\log(n))

times, and most tests to incorporate

\Omega(n/d)

items. Motivated by physical considerations, we study group testing models in which the testing procedure is constrained to be "sparse". Specifically, we consider (separately) scenarios in which (a) items are finitely divisible and hence may participate in at most

\gamma \in o(\log(n))

tests; or (b) tests are size-constrained to pool no more than

\rho \in o(n/d)

items per test. For both scenarios we provide information-theoretic lower bounds on the number of tests required to guarantee high probability recovery. In both scenarios we provide both randomized constructions (under both

\epsilon

-error and zero-error reconstruction guarantees) and explicit constructions of designs with computationally efficient reconstruction algorithms that require a number of tests that are optimal up to constant or small polynomial factors in some regimes of

n, d, \gamma,

and

\rho

. The randomized design/reconstruction algorithm in the

\rho

-sized test scenario is universal -- independent of the value of

d

, as long as

\rho \in o(n/d)

. We also investigate the effect of unreliability/noise in test outcomes. For the full abstract, please see the full text PDF

arXiv.org e-Print Archive

Explore Bristol Research

Approximating Properties of Data Streams

Author: Zhou Samson
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2018
Field of study

In this dissertation, we present algorithms that approximate properties in the data stream model, where elements of an underlying data set arrive sequentially, but algorithms must use space sublinear in the size of the underlying data set. We first study the problem of finding all k-periods of a length-n string S, presented as a data stream. S is said to have k-period p if its prefix of length n − p differs from its suffix of length n − p in at most k locations. We give algorithms to compute the k-periods of a string S using poly(k, log n) bits of space and we complement these results with comparable lower bounds. We then study the problem of identifying a longest substring of strings S and T of length n that forms a d-near-alignment under the edit distance, in the simultaneous streaming model. In this model, symbols of strings S and T are streamed at the same time and form a d-near-alignment if the distance between them in some given metric is at most d. We give several algorithms, including an exact one-pass algorithm that uses O(d2 + d log n) bits of space. We then consider the distinct elements and `p-heavy hitters problems in the sliding window model, where only the most recent n elements in the data stream form the underlying set. We first introduce the composable histogram, a simple twist on the exponential (Datar et al., SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the composable histogram along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and `p-heavy hitters that is nearly optimal in both n and c. Finally, we consider the problem of estimating the maximum weighted matching of a graph whose edges are revealed in a streaming fashion. We develop a reduction from the maximum weighted matching problem to the maximum cardinality matching problem that only doubles the approximation factor of a streaming algorithm developed for the maximum cardinality matching problem. As an application, we obtain an estimator for the weight of a maximum weighted matching in bounded-arboricity graphs and in particular, a (48 + )-approximation estimator for the weight of a maximum weighted matching in planar graphs

Purdue E-Pubs

New Frameworks for Offline and Streaming Coreset Constructions

Author: Braverman Vladimir
Feldman Dan
Lang Harry
Statman Adiel
Zhou Samson
Publication venue
Publication date: 18/09/2022
Field of study

A coreset for a set of points is a small subset of weighted points that approximately preserves important properties of the original set. Specifically, if

P

is a set of points,

Q

is a set of queries, and

f:P\times Q\to\mathbb{R}

is a cost function, then a set

S\subseteq P

with weights

w:P\to[0,\infty)

is an

\epsilon

-coreset for some parameter

\epsilon>0

\sum_{s\in S}w(s)f(s,q)

is a

(1+\epsilon)

multiplicative approximation to

\sum_{p\in P}f(p,q)

for all

q\in Q

. Coresets are used to solve fundamental problems in machine learning under various big data models of computation. Many of the suggested coresets in the recent decade used, or could have used a general framework for constructing coresets whose size depends quadratically on what is known as total sensitivity

t

. In this paper we improve this bound from

O(t^2)

O(t\log t)

. Thus our results imply more space efficient solutions to a number of problems, including projective clustering,

k

-line clustering, and subspace approximation. Moreover, we generalize the notion of sensitivity sampling for sup-sampling that supports non-multiplicative approximations, negative cost functions and more. The main technical result is a generic reduction to the sample complexity of learning a class of functions with bounded VC dimension. We show that obtaining an

(\nu,\alpha)

-sample for this class of functions with appropriate parameters

\nu

and

\alpha

suffices to achieve space efficient

\epsilon

-coresets. Our result implies more efficient coreset constructions for a number of interesting problems in machine learning; we show applications to

k

-median/

k

-means,

k

-line clustering,

j

-subspace approximation, and the integer

(j,k)

-projective clustering problem

arXiv.org e-Print Archive